For data \(\{(\boldsymbol{x}_i, y_i); i = 1, \ldots, N\}\) with \(\boldsymbol{x}_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}\),
\[y_i = f(\boldsymbol{x}_i) + \varepsilon_i\]
We propose a underlying function,
\[f(\cdot) \sim\mathcal{MVN} \left( \mu(\boldsymbol{x};\boldsymbol\theta_\mu), k(\boldsymbol{x}, \boldsymbol{x'}; \boldsymbol{\theta}_k) \right)\]
where \(\mu(\cdot)\) is the mean function and \(k(\cdot)\) is the covariance kernel function, with hyperparameters \(\boldsymbol\theta_\mu\) and \(\boldsymbol\theta_k\), respectively.
Assumes observational noise is Gaussian distributed: \[\varepsilon \sim \mathcal{N}(0, \sigma^2_\textrm{n})\]
which makes \(\boldsymbol{y}\) also Gaussian distributed and analytically tractable.
If noise is assumed as Gaussian then kernel hyperparameters found via maximum a posteriori estimation:
\[\mathrm p(\boldsymbol Y|\boldsymbol X) = \int \mathrm p(\boldsymbol{Y} | f, \boldsymbol{X})\cdot \mathrm p(f|\boldsymbol X) \mathrm{d}f \]
\[\begin{multline} \log (\mathrm p(\boldsymbol{Y} | \boldsymbol{X}, \boldsymbol{\theta})) = -\frac{1}{2}(\boldsymbol{Y} - \boldsymbol{\mu})^\top (\boldsymbol{K} + \sigma^2_n \boldsymbol{I})^{-1}(\boldsymbol Y - \boldsymbol\mu) \\- \frac{1}{2}\log(|\boldsymbol{K} + \sigma^2_n\boldsymbol I|) - \frac{n}{2}\log(2\pi) \end{multline}\]But what if the noise is not Gaussian?
Use heavier tailed distributions, e.g., Student-t, Laplace, Gaussian mixtures.
But these are usually not analytically tractable and tricky to implement.
And outliers still influence the model fit.
Propose: trim extreme values based on some criteria and refit.
Premise: Trim a proportion of points with the largest absolute residuals after fitting a GP.
Parameters: \(\alpha_1\), \(\alpha_2\), \(n_\textrm{sh}\), and \(n_\textrm{cc}\).
Comprises three phases:
Shrinking
Concentrating
Reweighting
Train standard GP on full sample and calculate normalised residual for each predicted point.
Retrain GP with proportion \(\alpha\) of points which have the smallest normalised residuals.
Repeat Step 2 for \(n_\textrm{sh}\) times while \(\alpha: 1 \rightarrow \alpha_1\).
Repeat Step 2 for \(n_\textrm{cc}\) times keeping \(\alpha\) constant.
Trimmed sample will underestimate the true variance in the sample so there’s an extra correction factor at different \(\alpha\)’s.
Exclude points whose corrected normalised residuals (\(r_i\)) are amongst the \((1 - \alpha_2)\)% largest.
Retrain GP.
Finally identify outliers as those observations whose corrected normalised residuals \(r_i > 3\).
\[f(x) = 0.3 + 0.4 x + 0.5 \sin(2.7 x) + \frac{1.1}{1+x^2} + \varepsilon\]
RMSE of the prediction residuals of each test set
| Parameter | Suggested |
|---|---|
| \(\alpha_1\) | 0.5 |
| \(n_\textrm{sh}\) | 2 |
| \(n_\textrm{cc}\) | 2 |
| \(\alpha_2\) | 0.975 |
biased by influential points near the boundary of data coverage where local SNR is low.
computation time is 3 to 5 times of standard GP.
Comments on Paper